pandas¶pandas is a package built on top of numpy. The core object of this package is the DataFrame, which are multidimensional datasets with row and column labels. In addition to offering an interface for tagged data, pandas also has powerful data manipulation functions.
pandas¶Installation of pandas requires that you first have numpy installed. To install it, use the command:
# Installing using pip
pip install pandas
# Installing using conda
conda install pandas
# If you don't have pandas installed
!pip install pandas
After installation, you can import it. The version can be checked as follows:
# Import libraries
import pandas as pd
import numpy as np
# Pandas Version
pd.__version__
pandas objects¶from IPython.display import Image
Image(filename='base_01_pandas_5_0.png', width=500)
# Resource: https://bookdata.readthedocs.io/en/latest/base/01_pandas.html
Let's work with the famous Titanic dataset. Here are the available fields:
passengerId: A numerical ID assigned to each passengersurvived: Whether the passenger survived (1), or didn't (0)pclass: The class the passenger was inname: the name of the passengersex: The gender of the passenger - male or femaleAge: The age of the passenger (includes fractional values)sibsp: The number of siblings and spouses the passenger had on boardparch: The number of parents and children the passenger had on boardticket: The ticket number of the passengerfare: How much the passenger paid for the ticketcabin: Which cabin the passenger was inembarked: Where the passenger boarded the Titanic# Load the file
titanic = pd.read_csv("Datasets/train.csv")
titanic.index
titanic.columns
for i in titanic.columns:
print(i)
type(titanic.PassengerId)
type(titanic)
titanic.values
loc and iloc¶Pandas provides some special indexer attributes that explicitly expose certain indexing schemes. These are not functional methods, but attributes that expose a particular way to retrieve slices of the data.
The loc attribute allows indexing and segmenting that always refers to the explicit index. The iloc attribute allows indexing and segmenting that always refers to the implicit Python-style index. To illustrate the difference, let's look at the examples below. Notice that loc is inclusive of both ends of specified ranges, and that columns are specified by name. On the other hand, iloc is exclusive of the upper end of specified ranges, and columns are specified by numerical index.
titanic.loc[10:15,'PassengerId':'Age']
titanic.iloc[10:15,0:5]
Now that we know a bit about how pandas indexes data, we can look at how to exrtract it. pandas has a number of ways for selecting subsets of a dataset:
# slicing by implicit index
titanic[10:15]
# fancy indexing
titanic[10:15][['PassengerId','Age']]
# masking
titanic[(titanic.Age > 18) & (titanic['Age'] < 50)][10:15]
titanic['Age'][10:15]
DataFrame¶from IPython.display import Image
Image(filename='AnatomyDataFrame.png', width=500)
#Resource: https://cvw.cac.cornell.edu/PyDataSci1/arrays_dataframes
concat()¶pandas has a function, pd.concat(), which has a syntax similar to np.concatenate but contains different arguments. It can be used for a simple concatenation of Series or DataFrames:
df1 = titanic[(titanic.Age > 18) & (titanic['Age'] < 50)][10:15][['PassengerId','Survived']]
df1.reset_index(inplace = True)
#df1
df2 = titanic.iloc[10:15,2:5]
df2.reset_index(inplace = True)
#df2
pd.concat([df1, df2],sort=True, axis=1)
pd.concat([df1, df2],ignore_index=True,sort=True, axis=0)
merge() and join()¶These functions allow data from different dataframes to be combined into one according to a "crossover" or "search" rule.
The merge() function is the default pandas function for merging data. It's basically pandas' counterpart of SQL JOINs, and requires specifying which columns of both DataFrames will be compared. merge() doesn't care at all about the indices defined on them.
On the other hand, join() is basically doing a merge but taking advantage of the indices of both DataFrames.
The following figure summarizes the different 4 types of JOINs: inner, outer, left and right.

The merge() function is also available as a method in the DataFrame class. The basic syntax is:
new_joined_df = df.merge (another_df, left_on = "col_in_df", right_on = "col_in_another_df",
how="inner"|"left"|"right"|"outer")
The first argument (another_df), as well as left_on and right_on are required arguments. left_on specifies a column name in df whose values must match those in column another_df 'specified in right_on. The how argument is optional and defaults to inner.
staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR'},
{'Name': 'Sally', 'Role': 'Course liasion'},
{'Name': 'James', 'Role': 'Grader'}])
staff_df = staff_df.set_index('Name')
student_df = pd.DataFrame([{'Name': 'James', 'School': 'Business'},
{'Name': 'Mike', 'School': 'Law'},
{'Name': 'Sally', 'School': 'Engineering'}])
student_df = student_df.set_index('Name')
pd.merge(staff_df, student_df,left_index=True, right_index=True)
pd.merge(staff_df, student_df, how='outer', left_index=True, right_index=True)
pd.merge(staff_df, student_df, how='left', left_index=True, right_index=True)
pd.merge(staff_df, student_df, how='right', left_index=True, right_index=True)
staff_df = staff_df.reset_index()
student_df = student_df.reset_index()
pd.merge(staff_df, student_df, how="left", left_on="Name", right_on="Name")
pd.merge(staff_df, student_df, how="left",on="Name")
staff_df = pd.DataFrame([{'First Name': 'Kelly', 'Last Name': 'Desjardins', 'Role': 'Director of HR'},
{'First Name': 'Sally', 'Last Name': 'Brooks', 'Role': 'Course liasion'},
{'First Name': 'James', 'Last Name': 'Wilde', 'Role': 'Grader'}])
student_df = pd.DataFrame([{'First Name': 'James', 'Last Name': 'Hammond', 'School': 'Business'},
{'First Name': 'Mike', 'Last Name': 'Smith', 'School': 'Law'},
{'First Name': 'Sally', 'Last Name': 'Brooks', 'School': 'Engineering'}])
pd.merge(staff_df, student_df, how='inner', left_on=['First Name','Last Name'], right_on=['First Name','Last Name'])
An essential part of analyzing lots of data is creating efficient summaries of it. Computational aggregations, such as sum(), mean(), median(), min() and max(), allow us to use relatively few numbers to give us an idea of the nature of a potentially large dataset.
In this section, we will explore aggregations in pandas, from simple operations similar to what we have seen in numpy arrays, to more sophisticated operations based on the concept of groupby().
titanic['Age'].mean()
titanic.groupby('Sex').size()
titanic.groupby('Sex').count()
prueba = titanic.groupby('Sex').agg({'Age':['min','mean','max']})
prueba.loc()
crosstab()¶One of the useful functions in pandas is crosstab(). This allows us to segregate our data into buckets using the classes of a few categorical variables, then construct a table showing the number of data points in each bucket:
pd.crosstab(titanic['Sex'], titanic['Pclass'])
These are just like the pivot tables in Excel. Here, we use it to look at the average age of passengers, segregated by class and gender:
titanic.pivot_table(values='Age',index='Sex',columns='Pclass', aggfunc=np.mean)
pandas¶pandas has some functionaliy for working with dates as well:
import seaborn as sns
# Load dataset
flights = sns.load_dataset('flights')
flights.head(2)
flights['date'] = flights['year'].astype(str)+ ' ' + flights['month'].astype(str)
flights.head(2)
flights['date'] = pd.to_datetime(flights['date'])
flights.head(2)
flights.set_index('date',inplace=True)
flights.groupby(pd.Grouper(freq='Q')).agg({'passengers':'mean'}).plot()
flights.resample('y').mean()
flights.index.weekday_name
flights.resample('y').mean()['passengers'].diff(2)
flights['1950-12-31':].head()
flights.reset_index(inplace=True)
flights['weekday_name'] = flights['date'].dt.weekday_name
flights.groupby('weekday_name').size()
flights['weekday_name'] = pd.Categorical(flights['weekday_name'], categories=
['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'],
ordered=True)
flights.groupby('weekday_name').size()
As we go through the dataset, knowing some functions that pandas provide us to make our lives easier is crucial. Here, we want to identify the types of data in our dataset, if we have nulls in our dataset, etc. Some functions that can help us for this include:
df.head()
df.tail()
df.info()
df.shape
df.columns
df.describe()
df.value_counts()
df.unique()
df.nunique()
# See first two rows
titanic.head(2)
# See last two rows
titanic.tail(2)
# How many rows and columns in dataset
titanic.shape
# Data types and amount of null values in dataset
titanic.info()
# Another way to see null values per column
titanic.isnull().sum()
# Another way to see data types
titanic.dtypes
# Information about numeric columns
titanic.describe()
# Information about categorical columns
titanic.describe(include = ['O'])
# Unique values per column excluding null
titanic.nunique()
# Unique values per column including null
titanic.nunique(dropna=False)
# Counting
titanic.count()
# Most frequent value in the series
titanic['Embarked'].mode()
# Unique values of the column 'Embarked'
titanic['Embarked'].unique()
# How many records do we have by category of the column 'Embarked' without null values
titanic['Embarked'].value_counts()
# How many records do we have by category of the column 'Embarked' with null values
titanic['Embarked'].value_counts(dropna=False)
# Names of the columns in the dataset
titanic.columns
titanic.columns = ['Passenger_Id', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']
titanic = titanic.rename(columns={'Passenger_Id':'PassengerId'})
titanic.head()
?titanic.rename
titanic[(titanic['Embarked']=='S') & (titanic['Sex']=='female')][['Name','Age']].head()
Using the sex, Age, and fare columns, answer the following questions:
# Your code ...
This is what you should get for #1:
male 577
female 314
Name: Sex, dtype: int64
Double-click Here to see the solution.
# Your code ...
This is what you should get for #2:
0 24.0
dtype: float64
Double-click Here to see the solution.
# Your code ...
This is what you should get for #3:
0 24.0
dtype: float64
Double-click Here to see the solution.
# Your code ...
This is what you should get for #4:
0 19.0
1 25.0
2 28.0
dtype: float64
Double-click Here to see the solution.
# Your code ...
This is what you should get for #5:
0 8.05
dtype: float64
Double-click Here to see the solution.
# Your code ...
This is what you should get for #6:
Common ticket price for women :7.75
Common ticket price for men :8.05
Double-click Here to see the solution.
Now that we have an idea of what our dataset consists of, let's transform it so that we can display phase. The types of activities we may engage in during pre-processing include:
among others.
titanic.columns
# Delete columns PassengerId, Name, Cabin, we are not going to delete the ticket column yet.
titanic.drop(['PassengerId','Name','Cabin'],axis=1, inplace=True)
titanic.shape
# Null validation again
titanic.isnull().sum()
# In the first 10 records we have a null age field, we are going to have it as a reference for what we are going to do
titanic.head(10)
Depending on the problem, the treatment of nulls can be handled in different ways:
You will learn about when to do each in a later case.
# Delete
titanic.dropna(inplace=True)
titanic.shape
# Null validation
titanic.isnull().sum()
# In the first 10 records we have a null age field, we are going to have it as a reference for what we are going to do
titanic.head(10)
titanic.info()
titanic.reset_index(drop=True,inplace=True)
titanic.head(10)
# Leave a fixed value that differentiates them
titanic = pd.read_csv("Datasets/train.csv")
titanic.drop(['PassengerId','Name','Cabin'],axis=1, inplace=True)
titanic.shape
titanic.fillna(-1).head(10)
titanic['Age'].fillna(titanic['Age'].mean()).head(10)
titanic['Age'].fillna(titanic['Age'].median()).head(10)
titanic['Age'].fillna(titanic['Age'].mode()[0]).head(10)
titanic['Age'].fillna(method='backfill').head(10)
titanic['Age'].fillna(method='ffill').head(10)
titanic['Age'].interpolate(method='polynomial', order=2).head(10)
For our exercise we will make the decision to drop the nulls.
# Drop nulls
titanic.dropna(inplace=True)
titanic.shape
To be practical, we normally modify the names of the columns in order to remove blank spaces and special characters, and to convert uppercase to lowercase:
# Rename columns to lowercase
titanic.rename(columns = lambda col: col.lower(),inplace=True)
titanic.columns
# Validate dataset
titanic.info()
titanic.shape
# After eliminating the nulls, the indexes must be reset, as you can see they still appear from 0 to 890
# even though we have 712 rows.
titanic.reset_index(drop=True,inplace=True)
titanic.info()
titanic.groupby(['sex','survived']).size()
titanic.groupby(['sex','survived']).agg({'fare':'count','age':'mean'}).reset_index()
# Validate duplicate tickets
df_tickets = titanic.groupby('ticket').size().reset_index().rename(columns={0:'count'})
df_tickets.sort_values(by='count',ascending=False)
titanic[titanic['ticket']=='347082']
titanic[titanic['ticket']=='110413']
# Sort the number of records by tickets
df_tickets.sort_values(by=["count"],ascending=False,inplace=True)
df_tickets.head()
# Let's validate a ticket
titanic[titanic['ticket']=='347082']
# First we will add the column 'count' from df_tickets to our titanic dataset
titanic = titanic.merge(df_tickets,left_on='ticket',right_on='ticket',how='left')
titanic.shape
titanic.head()
We will enrich this dataset by creating a new column family, which will be "1" if the trip was taken with others, and "0" if they traveled alone:
titanic['family'] = [1 if count>1 else 0 for count in titanic['count']]
titanic.head()
# Remember the functions? we are going to create a function that will help us to clean text, in this case the tickets field
def clean_text(string):
import re
regex = '[^A-Za-z0-9]+'
return re.sub(regex, '', string)
titanic['ticket'] = titanic['ticket'].apply(clean_text)
titanic.head()
# To continue we will delete the ticket column
titanic.drop('ticket',axis=1,inplace=True)
titanic.head()
Sometimes, it is helpful to create a column for each possible category of one of our categorical variables. For this, we use a pandas method called get_dummies():
titanic['sexDummies'] = titanic['sex']
titanic = pd.get_dummies(titanic, columns = ['sexDummies'], prefix = ['D'])
titanic.head()
sex) and calculate the average age (age).family) and count the number of records per group.# Your code ...
This is what you should get for #1:
| _ | sex | age |
|---|---|---|
| 0 | female | 27.915709 |
| 1 | male | 30.726645 |
Double-click Here to see the solution.
# Your code ..
This is what you should get for #2:
| _ | family | count |
|---|---|---|
| 0 | 0 | 547 |
| 1 | 1 | 344 |
Double-click Here to see the solution.
# Your code ...
This is what you should get for #3:
| _ | family | survived | count |
|---|---|---|---|
| 0 | 0 | 0 | 384 |
| 1 | 0 | 1 | 163 |
| 2 | 1 | 0 | 165 |
| 3 | 1 | 1 | 179 |
Double-click Here to see the solution.
To conduct descriptive analysis of our data, we make use of various graphs to gather more information that we will use later on in the data science process when building our models. It is essential to have knowledge of the Python libraries that allow us to generate these visualizations, such as matplotlib and seaborn. pandas also has a few visualization methods.
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
titanic.head()
# Let's start with a simple graph, by not placing X it is assumed that x is each of the records.
plt.figure(figsize=(6,3))
plt.plot(titanic['age'])
plt.title('Titanic Graficando la Edad')
plt.xlabel('id')
plt.ylabel('age')
plt.show()
?plt
plt.figure(figsize=(6,4))
sns.distplot(titanic['age'])
plt.title('Titanic Histograma de la Edad')
plt.show()
# If we only want the density function
plt.figure(figsize=(6,3))
sns.kdeplot(titanic['age'],shade=True)
plt.title('Titanic Densidad de la Edad')
plt.xlabel('age');
g = sns.FacetGrid(titanic, col="survived", height=3)
g.map(sns.kdeplot, 'age', shade=True)
g.despine(left=True,bottom=True);
g = sns.FacetGrid(titanic, col='survived', row='pclass', hue='sex', height=3)
g.map(sns.kdeplot, 'age', shade=True).add_legend()
g.despine(left=True, bottom=True)
plt.show()
plt.figure(figsize=(6,3))
sns.boxplot(titanic['age'])#,orient='v')
plt.title('Titanic Age Boxplot')
plt.show()
plt.figure(figsize=(6,4))
sns.boxplot(x='survived',y='age',data=titanic)
plt.title('Titanic Boxplot of survivors and age')
plt.show()
plt.figure(figsize=(10,4))
sns.boxplot(x='sex',y='age',hue='survived',data=titanic)
plt.title('Titanic Boxplot of survivors, gender and age')
plt.show()
plt.figure(figsize=(6, 4))
sns.countplot('survived',data=titanic)
plt.show()
plt.figure(figsize=(6, 4))
sns.barplot(y='age',x='pclass',data=titanic)
plt.show()
plt.figure(figsize=(6, 4))
sns.barplot(y='survived',x='sex',hue='pclass',data=titanic)
plt.show()
plt.figure(figsize=(6, 4))
sns.barplot(y='pclass',x='sex',hue='family',data=titanic)
plt.show()
titanic.drop(['survived','family','D_female','D_male','count'],axis=1).corr()
plt.figure()
sns.heatmap(titanic.drop(['survived','family','D_female','D_male','count'],axis=1).corr(),annot=True, linewidth=0.5,fmt='.1f')
plt.show()
plt.figure()
sns.scatterplot(x='age', y='fare', hue='sex', style='survived', data=titanic)
plt.show()
plt.figure()
sns.scatterplot(x='pclass', y='age', hue='sex', style='survived',size='count', data=titanic)
plt.legend(loc='upper left')
plt.show()
plt.figure()
sns.scatterplot(x='age', y='fare',data=titanic)
plt.show()
type(titanic.columns)
for col in titanic.columns:
print(col)
!pip install plotly
import plotly.express as px
fig = px.scatter(titanic,x='age',y='fare', color='sex',size='count')
fig.show()